Mining, Classification, Prediction

Kenneth Imphean ksi238

Introduction

The two datasets I have used before on Project 1. I have chosen County Health Rankings in Texas and COVID-19 vaccinations by county in Texas. The variables contained in the County Health Rankings in Texas come partially from my Biostatistics project: county, food environment index, primary care physician population, and urban/rural classification. The variables contained in COVID-19 vaccinations by county in Texas deal with counties, the population who is eligible for vaccination (people 12 and older), and population of fully vaccinated individuals. The food environment index was calculated by measuring access to healthy foods and the income of counties and primary care physician population was found by looking at ratios of population to primary care physicians as well as including D.O.s. Population data was aquired by the 2019 US Census Bureau Population Estimates and the population who have been fully vaccinated or partially vaccinated was collected through vaccination records submitted by health care providers.

This data is interesting to me because I want to see if there’s any sort of relationship to be found within between the two datasets. The food environment index takes into account access to supermarkets and grocery stores, which usually also have pharmacies as well where vaccinations can be given. There are 254 observations total. There are some observations with NA values, when they are removed the total is 232. Of those 254 observations, 172 are rural and 82 are urban! Of those the set with no NA values, there are 156 that are rural and 76 that are urban.

library(tidyverse)
countyhealth <- read_csv("~/Data from County Health Rankings and Biostats project.csv")
countyvaccines <- read_csv("~/COVID-19 Vaccine Data by County.csv")
combinedcounty <- inner_join(countyhealth, countyvaccines, by = "county")
combinedcountyfinal <- combinedcounty %>% mutate(percentage.fully.vaccinated = (people.fully.vaccinated/population.12.and.older) * 
    100, percentage.vaccinated.with.at.least.one.dose = (people.vaccinated.with.at.least.one.dose/population.12.and.older) * 
    100) %>% select(-total.doses.allocated, -vaccine.doses.administered, 
    -people.fully.vaccinated, -people.vaccinated.with.at.least.one.dose)
combinedcountyfinal <- combinedcountyfinal %>% mutate(urban = ifelse(urban.rural.classification == 
    "Urban", 1, 0))
combinedcountyfinalnona <- combinedcountyfinal %>% na.omit()
combinedcountyfinalnona %>% group_by(urban.rural.classification) %>% 
    summarize(count = n())
## # A tibble: 2 x 2
##   urban.rural.classification count
##   <chr>                      <int>
## 1 Rural                        156
## 2 Urban                         76

Cluster Analysis

library(cluster)
clust_dat <- combinedcountyfinal %>% dplyr::select(food.environment.index, 
    percentage.fully.vaccinated, primary.care.physician.population) %>% 
    na.omit()

clust_dat <- clust_dat %>% na.omit() %>% scale

sil_width <- vector()
for (i in 2:10) {
    pam_fit <- pam(clust_dat, k = i)
    sil_width[i] <- pam_fit$silinfo$avg.width
}
ggplot() + geom_line(aes(x = 1:10, y = sil_width)) + scale_x_continuous(name = "k", 
    breaks = 1:10)

final <- clust_dat %>% as.data.frame
pam1 <- final %>% pam(4)
pam1
## Medoids:
##       ID food.environment.index percentage.fully.vaccinated
## [1,]  23             -0.0933065                  -0.6051163
## [2,]  33              0.7322188                   0.2688540
## [3,] 200             -1.8360822                   0.5711513
## [4,] 205              0.7322188                   2.0288380
##      primary.care.physician.population
## [1,]                        -0.1934691
## [2,]                        -0.2515207
## [3,]                        -0.2732901
## [4,]                         3.6379412
## Clustering vector:
##   [1] 1 2 1 3 1 2 2 2 1 2 1 1 1 4 2 1 2 1 2 2 2 3 1 2 2 2 2 1 3 1 1 2 2 1 1 1 1
##  [38] 1 4 1 2 2 1 2 1 1 3 2 2 2 1 4 1 2 1 2 2 1 3 1 2 1 2 4 1 1 1 2 1 2 3 4 1 1
##  [75] 2 1 2 2 2 1 2 1 1 1 1 2 1 1 2 3 1 1 4 1 1 2 2 1 3 1
##  [ reached getOption("max.print") -- omitted 132 entries ]
## Objective function:
##     build      swap 
## 0.8680776 0.8250323 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"
final <- final %>% mutate(cluster = as.factor(pam1$clustering))
ggplot(final, aes(x = food.environment.index, y = percentage.fully.vaccinated, 
    color = cluster)) + geom_point()

final %>% slice(pam1$id.med)
##   food.environment.index percentage.fully.vaccinated
## 1             -0.0933065                  -0.6051163
## 2              0.7322188                   0.2688540
## 3             -1.8360822                   0.5711513
## 4              0.7322188                   2.0288380
##   primary.care.physician.population cluster
## 1                        -0.1934691       1
## 2                        -0.2515207       2
## 3                        -0.2732901       3
## 4                         3.6379412       4
library(plotly)
final %>% plot_ly(x = ~food.environment.index, y = ~percentage.fully.vaccinated, 
    z = ~primary.care.physician.population, color = ~cluster, 
    type = "scatter3d", mode = "markers")
plot(pam1, which = 2)

library(GGally)
ggpairs(final, aes(color = cluster))

When I scaled and standardized all the data, I thought it was interesting to see that there were negative values within my data. I chose to do 4 clusters as it had the highest silhouette width. Cluster 4 seems to have all high values of food.environment.index, percentage of fully vaccinated people, and primary care physicians. Cluster 3 seems to have really low food environment index, medium level percentage of fully vaccinated people, and somewhat low primary care physicians. Cluster 2 has a medium food environment index, medium percentage of fully vaccinated people, and a low population of primary care physicians. Cluster 1 has low values of food environment index, fully vaccinated people, and primary care physicians. In the end, I don’t think the goodness-of-fit was that great, the highest average silhouette width was only .34 which means the structure is weak and may be artificial.

Dimensionality Reduction with PCA

countyfinal <- combinedcountyfinal %>% na.omit() %>% select(-urban) %>% 
    select_if(is.numeric) %>% scale
rownames(countyfinal) <- combinedcountyfinalnona$county
county_pca <- princomp(countyfinal, cor = T)
names(county_pca)
## [1] "sdev"     "loadings" "center"   "scale"    "n.obs"    "scores"   "call"
summary(county_pca, loadings = T)
## Importance of components:
##                          Comp.1    Comp.2    Comp.3      Comp.4      Comp.5
## Standard deviation     1.624027 1.1832634 0.9532786 0.201300882 0.114720132
## Proportion of Variance 0.527493 0.2800224 0.1817480 0.008104409 0.002632142
## Cumulative Proportion  0.527493 0.8075154 0.9892634 0.997367858 1.000000000
## 
## Loadings:
##                                              Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## food.environment.index                               0.428  0.903              
## primary.care.physician.population             0.507  0.440 -0.220        -0.708
## population.12.and.older                       0.507  0.438 -0.228         0.705
## percentage.fully.vaccinated                   0.499 -0.452  0.199 -0.711       
## percentage.vaccinated.with.at.least.one.dose  0.486 -0.477  0.209  0.702
countydf <- data.frame(County = combinedcountyfinalnona$county, 
    PC1 = county_pca$scores[, 1], PC2 = county_pca$scores[, 2])
ggplot(countydf, aes(PC1, PC2)) + geom_point()

I decided to retain PC1 and PC2 as they were enough to explain about 81% of the total variance in the dataset. For PC1, a high score means higher population of people 12 and older, as well as primary care physicians. It also means a larger percentage that are vaccinated with one dose and fully vaccinated. A lower score for PC1 means a lower population of people 12 and older and lower number of primary care physicians. It also means a lower percentage that are vaccinated with one dose and fully vaccinated. Interestingly, the food environment index does not play a big role in PC1.

For PC2 a high score means high food environment index, primary care physician population, population 12 and older, but a lower percentage fully vaccinated and lower percentage vaccinated with at least one dose. A low score for PC2 means lower food environment index, primary care physician, and population of people 12 and older. It also means higher percentage of people vaccinated with at least one dose (and fully vaccinated).

Linear Classifier

library(caret)

fit <- glm(urban ~ food.environment.index + primary.care.physician.population + 
    population.12.and.older + percentage.fully.vaccinated + percentage.vaccinated.with.at.least.one.dose, 
    data = combinedcountyfinalnona, family = "binomial")
coef(fit)
##                                  (Intercept) 
##                                -7.8100830137 
##                       food.environment.index 
##                                 0.6230329258 
##            primary.care.physician.population 
##                                -0.0221492676 
##                      population.12.and.older 
##                                 0.0000673533 
##                  percentage.fully.vaccinated 
##                                 0.1191668386 
## percentage.vaccinated.with.at.least.one.dose 
##                                -0.0928097189
probs <- predict(fit, type = "response")
probs %>% round(3)
##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## 0.441 0.192 0.645 0.084 0.041 0.545 0.359 0.108 0.147 0.851 0.083 0.177 1.000 
##    14    15    16    17    18    19    20    21    22    23    24    25    26 
## 1.000 0.214 0.050 0.159 0.628 1.000 0.998 0.064 0.009 0.228 0.214 0.401 0.393 
##    27    28    29    30    31    32    33    34    35    36    37    38    39 
## 0.138 0.098 1.000 0.069 0.181 0.098 0.457 0.398 0.063 0.076 0.062 0.065 1.000 
##    40    41    42    43    44    45    46    47    48    49    50    51    52 
## 0.074 0.242 0.993 0.118 0.078 0.358 0.617 0.004 0.090 0.115 0.105 0.025 1.000 
##    53    54    55    56    57    58    59    60    61    62    63    64    65 
## 0.049 0.177 0.059 1.000 0.174 0.053 0.005 0.073 0.988 0.071 0.998 1.000 0.226 
##    66    67    68    69    70    71    72    73    74    75    76    77    78 
## 0.087 0.251 0.233 0.080 0.120 0.007 1.000 0.122 0.112 0.207 0.208 1.000 0.170 
##    79    80    81    82    83    84    85    86    87    88    89    90    91 
## 0.219 0.052 0.176 0.149 0.971 0.830 0.157 0.998 0.248 0.037 0.089 0.023 0.057 
##    92    93    94    95    96    97    98    99   100 
## 0.492 1.000 0.663 0.031 1.000 0.127 0.728 1.000 0.318 
##  [ reached getOption("max.print") -- omitted 132 entries ]
class_diag(probs, combinedcountyfinalnona$urban, positive = 1)
##            acc   sens   spec   ppv     f1     ba   auc
## Metrics 0.8922 0.7368 0.9679 0.918 0.8175 0.8524 0.881
table(truth = combinedcountyfinalnona$urban, prediction = probs > 
    0.5) %>% addmargins
##      prediction
## truth FALSE TRUE Sum
##   0     151    5 156
##   1      20   56  76
##   Sum   171   61 232
cv <- trainControl(method = "cv", number = 8, classProbs = T, 
    savePredictions = T)
fit <- train(urban ~ food.environment.index + primary.care.physician.population + 
    population.12.and.older + percentage.fully.vaccinated + percentage.vaccinated.with.at.least.one.dose, 
    data = combinedcountyfinalnona, trControl = cv, method = "glm")
class_diag(fit$pred$pred, fit$pred$obs, positive = 1)
##            acc   sens   spec  ppv     f1     ba    auc
## Metrics 0.7716 0.3158 0.9936 0.96 0.4752 0.6547 0.7613

The values is 1 for urban and 0 for rural. The model is pretty fair in predicting the new observations per CV AUC. There is definitely signs of overfitting as the CV AUC is less when compared to the regular AUC. The confusion matrix is interesting in that it mostly predicts urban counties correctly, but utterly fails as it tries to predicts rural counties.

Non-Parametric Classifier

library(caret)
knn_fit <- knn3(urban ~ food.environment.index + primary.care.physician.population + 
    population.12.and.older + percentage.fully.vaccinated + percentage.vaccinated.with.at.least.one.dose, 
    data = combinedcountyfinalnona, k = 8)
y_hat_knn <- predict(knn_fit, combinedcountyfinalnona)
y_hat_knn
##            0     1
##   [1,] 0.500 0.500
##   [2,] 0.750 0.250
##   [3,] 0.375 0.625
##   [4,] 0.750 0.250
##   [5,] 0.750 0.250
##   [6,] 0.625 0.375
##   [7,] 0.875 0.125
##   [8,] 1.000 0.000
##   [9,] 0.750 0.250
##  [10,] 0.375 0.625
##  [11,] 1.000 0.000
##  [12,] 1.000 0.000
##  [13,] 0.000 1.000
##  [14,] 0.000 1.000
##  [15,] 1.000 0.000
##  [16,] 1.000 0.000
##  [17,] 0.875 0.125
##  [18,] 0.250 0.750
##  [19,] 0.000 1.000
##  [20,] 0.000 1.000
##  [21,] 0.875 0.125
##  [22,] 0.750 0.250
##  [23,] 1.000 0.000
##  [24,] 0.875 0.125
##  [25,] 0.625 0.375
##  [26,] 0.625 0.375
##  [27,] 0.750 0.250
##  [28,] 0.875 0.125
##  [29,] 0.000 1.000
##  [30,] 1.000 0.000
##  [31,] 0.875 0.125
##  [32,] 1.000 0.000
##  [33,] 0.625 0.375
##  [34,] 0.625 0.375
##  [35,] 1.000 0.000
##  [36,] 0.875 0.125
##  [37,] 1.000 0.000
##  [38,] 1.000 0.000
##  [39,] 0.000 1.000
##  [40,] 1.000 0.000
##  [41,] 0.750 0.250
##  [42,] 0.000 1.000
##  [43,] 0.875 0.125
##  [44,] 1.000 0.000
##  [45,] 0.625 0.375
##  [46,] 0.500 0.500
##  [47,] 0.875 0.125
##  [48,] 1.000 0.000
##  [49,] 1.000 0.000
##  [50,] 0.625 0.375
##  [ reached getOption("max.print") -- omitted 182 rows ]
class_diag(y_hat_knn[, 2], combinedcountyfinalnona$urban, positive = 1)
##            acc   sens   spec    ppv    f1    ba    auc
## Metrics 0.8578 0.6053 0.9808 0.9388 0.736 0.793 0.9219
table(truth = factor(combinedcountyfinalnona$urban == 1, levels = c("TRUE", 
    "FALSE")), prediction = factor(y_hat_knn[, 2] > 0.5, levels = c("TRUE", 
    "FALSE"))) %>% addmargins
##        prediction
## truth   TRUE FALSE Sum
##   TRUE    46    30  76
##   FALSE    3   153 156
##   Sum     49   183 232
cv <- trainControl(method = "cv", number = 8, classProbs = T, 
    savePredictions = T)
fit <- train(urban ~ food.environment.index + primary.care.physician.population + 
    population.12.and.older + percentage.fully.vaccinated + percentage.vaccinated.with.at.least.one.dose, 
    data = combinedcountyfinalnona, trControl = cv, method = "knn")
class_diag(fit$pred$pred, fit$pred$obs, positive = 1)
##            acc  sens   spec   ppv     f1     ba   auc
## Metrics 0.8434 0.636 0.9444 0.848 0.7268 0.7902 0.839

The model is good at predicting new observations per CV AUC. There are signs of overfitting as the CV AUC is less than the regular AUC. The nonparametric model is actually better than the linear model in its cross-validation performance, the AUC is higher. The confusion matrix predicted more rural counties correctly than the linear classifier confusion matrix.

Regression/Numeric Prediction

fit <- lm(percentage.fully.vaccinated ~ primary.care.physician.population + 
    food.environment.index, data = combinedcountyfinalnona)
yhat <- predict(fit)
mean((combinedcountyfinalnona$percentage.fully.vaccinated - yhat)^2)
## [1] 113.1287
k = 8
data <- combinedcountyfinalnona[sample(nrow(combinedcountyfinalnona)), 
    ]
folds <- cut(seq(1:nrow(combinedcountyfinalnona)), breaks = k, 
    labels = F)
diags <- NULL
for (i in 1:k) {
    train <- data[folds != i, ]
    test <- data[folds == i, ]
    fit <- lm(percentage.fully.vaccinated ~ primary.care.physician.population + 
        food.environment.index, data = train)
    yhat <- predict(fit, newdata = test)
    diags <- mean((test$percentage.fully.vaccinated - yhat)^2)
}
mean(diags)
## [1] 160.6415

The MSE for the overall dataset was 113.1287. There is definitely signs of overfitting as the average MSE from my k testing folds were higher than the MSE for the overall dataset.

Python

library(reticulate)
hi <- "Hello"
combinedcounty %>% mutate(percentage.fully.vaccinated = (people.fully.vaccinated/population.12.and.older) * 
    100, percentage.vaccinated.with.at.least.one.dose = (people.vaccinated.with.at.least.one.dose/population.12.and.older) * 
    100)
## # A tibble: 254 x 11
##    county food.environmen… primary.care.ph… urban.rural.cla… total.doses.all…
##    <chr>             <dbl>            <dbl> <chr>                       <dbl>
##  1 Ander…              6.4               19 Rural                       17870
##  2 Andre…              7.8                9 Rural                        6100
##  3 Angel…              6.3               51 Rural                       58100
##  4 Arans…              5.4                9 Urban                        6840
##  5 Archer              7.7               NA Urban                        3200
##  6 Armst…              6.7                0 Urban                        2100
##  7 Atasc…              7.6               11 Urban                       19000
##  8 Austin              7.9                5 Urban                        8700
##  9 Bailey              7.9                2 Rural                       24135
## 10 Bande…              6.6                5 Urban                        5100
## # … with 244 more rows, and 6 more variables: vaccine.doses.administered <dbl>,
## #   people.vaccinated.with.at.least.one.dose <dbl>,
## #   people.fully.vaccinated <dbl>, population.12.and.older <dbl>,
## #   percentage.fully.vaccinated <dbl>,
## #   percentage.vaccinated.with.at.least.one.dose <dbl>
import pandas as pd
(r.combinedcounty.assign(PercentageofFullyVaccinated = lambda x: (r.combinedcounty["people.fully.vaccinated"]/r.combinedcounty["population.12.and.older"])*100)
.assign(PercentageofVaccinatedatleastonedose= lambda x:(r.combinedcounty["people.vaccinated.with.at.least.one.dose"]/r.combinedcounty["population.12.and.older"])*100)
.sort_values(by=('PercentageofFullyVaccinated'), ascending=True))
##             county                  ...                   PercentageofVaccinatedatleastonedose
## 134           King                  ...                                              26.976744
## 150         Loving                  ...                                              31.538462
## 175         Newton                  ...                                              29.357335
## 82          Gaines                  ...                                              29.461144
## 172         Motley                  ...                                              41.484301
## 147       Lipscomb                  ...                                              36.100815
## 196        Roberts                  ...                                              36.301370
## 43   Collingsworth                  ...                                              39.761513
## 62         Dickens                  ...                                              37.982501
## 16          Borden                  ...                                              38.488576
## 148       Live Oak                  ...                                              39.943209
## 203    San Jacinto                  ...                                              40.684029
## 210        Sherman                  ...                                              46.287328
## 158         Martin                  ...                                              44.198524
## 179         Oldham                  ...                                              40.918691
## 32          Carson                  ...                                              41.651883
## 33            Cass                  ...                                              41.760810
## 229         Upshur                  ...                                              41.968618
## 59           Delta                  ...                                              41.644325
## 66        Eastland                  ...                                              43.321368
## 103        Haskell                  ...                                              45.226329
## 113         Howard                  ...                                              42.378078
## 116     Hutchinson                  ...                                              43.139622
## 18           Bowie                  ...                                              44.025636
## 57          Dawson                  ...                                              43.875895
## 205       San Saba                  ...                                              43.947717
## 201         Sabine                  ...                                              42.878513
## 214       Stephens                  ...                                              42.430651
## 168       Montague                  ...                                              44.416032
## 189          Rains                  ...                                              43.239489
## ..             ...                  ...                                                    ...
## 56          Dallas                  ...                                              73.642414
## 45           Comal                  ...                                              72.816969
## 185          Pecos                  ...                                              74.151376
## 83       Galveston                  ...                                              71.119941
## 104           Hays                  ...                                              74.881504
## 123       Jim Hogg                  ...                                              81.373265
## 60          Denton                  ...                                              72.633149
## 129        Kendall                  ...                                              74.194345
## 141       La Salle                  ...                                              76.269878
## 136        Kleberg                  ...                                              76.823952
## 100         Harris                  ...                                              78.031637
## 14           Bexar                  ...                                              79.242652
## 244        Willacy                  ...                                              79.894824
## 65           Duval                  ...                                              81.879560
## 232      Val Verde                  ...                                              87.695806
## 114       Hudspeth                  ...                                              79.729403
## 23          Brooks                  ...                                              81.081554
## 63          Dimmit                  ...                                             106.534750
## 245     Williamson                  ...                                              80.266381
## 252         Zapata                  ...                                              82.400146
## 42          Collin                  ...                                              80.228136
## 226         Travis                  ...                                              82.485059
## 78       Fort Bend                  ...                                              86.433031
## 70         El Paso                  ...                                              90.212661
## 107        Hidalgo                  ...                                              94.436963
## 30         Cameron                  ...                                              95.904598
## 161       Maverick                  ...                                             114.542086
## 213          Starr                  ...                                              99.807476
## 239           Webb                  ...                                             112.245796
## 188       Presidio                  ...                                             119.195104
## 
## [254 rows x 11 columns]

In python, I was able to recalculate my dataset’s percentage of people fully vaccinated and percentage of people vaccinated with one dose through Python’s version of tidyverse!